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Abstract. We study zero-sum stochastic differential games with player dynamics gov- 
erned by a nondegenerate controlled diffusion process. Under the assumption of uniform 
stability, we establish the existence of a solution to the Isaac's equation for the ergodic 
game and characterize the optimal stationary strategies. The data is not assumed to be 
bounded, nor do we assume geometric ergodicity. Thus our results extend previous work 
in the literature. We also study a relative value iteration scheme that takes the form of a 
parabolic Isaac's equation. Under the hypothesis of geometric ergodicity we show that the 
relative value iteration converges to the elliptic Isaac's equation as time goes to infinity. We 
use these results to establish convergence of the relative value iteration for risk-sensitive 
control problems under an asymptotic flatness assumption. 



1. Introduction 

In this paper we consider a relative value iteration for zero-sum stochastic differential 
games. This relative value iteration is introduced in [1] for stochastic control, and we follow 
the method introduced in this paper. 

In Section 2, we prove the existence of a solution to the Isaac's equation corresponding to 
the ergodic zero-sum stochastic differential game. We do not assume that the data or the 
running payoff function is bounded, nor do we assume geometric ergodicity, so our results 
extend the work in [3]. In Section 3, we introduce a relative value iteration scheme for 
the zero-sum stochastic differential game and prove its convergence under a hypothesis of 
geometric ergodicity. In Section 4, we apply the results from Section 3 and study a value 
iteration scheme for risk-sensitive control under an asymptotic flatness assumption. 

2. Problem Description 

We consider zero-sum stochastic differential games with state dynamics modeled by a 
controlled nondegenerate diffusion process X = {X(t) : < t < oo}, and subject to a 
long-term average payoff criterion. 

2.1. State dynamics. Let Ui, i = 1, 2 , be compact metric spaces and Vi = V{Ui) denote 
the space of all probability measures on Ui with Prohorov topology. Let b : M. d x U± x U2 — > M rf 
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and a : M d — > M. dxd be measurable functions. Assumptions on b and a will be specified later. 
Define b : R d x V\ x V 2 -> K d as 

6(a;,«i,«2) := / / "l, u 2 ) «i(d«i) t> 2 (du 2 ) , 

for x G fi G Vi and v 2 G V2. We model the controlled diffusion process X via the Ito 
s.d.e. 

dX(i) = b(X(t),v 1 (t),v 2 (t))dt + a(X(t))dW(t). (2.1) 

All processes in (|2.ip are defined in a common probability space (J?, J- , P) which is assumed 
to be complete. The process W = {W(t) : < t < 00} is an IR^-valued standard Wiener 
process which is independent of the initial condition Xq of (|2.f p . Player i, with i = 1,2, 
controls the dynamics X through her strategy Wj(-), a Vi-valued process which is jointly 
measurable in (t,u) £ [0, 00) x f] and non-anticipative, i.e., for s < t, W(t) — W(s) is 
independent of 

F s := the completion of a(Xo,vi(r),v 2 (r),W(r),r < s) . 

We denote the set of all such controls (admissible controls) for player i by Ui, i = 1,2 . 

Assumptions on the Data: We assume the following conditions on the coefficients b and a 
to ensure existence of a unique solution to (|2.ip . 

(Al) The functions b and a are locally Lipschitz continuous in x G M d , uniformly over 
(^l) ^2) £ Ui x U 2 , and have at most a linear growth rate in x G Also b is 
continuous. 

(A2) For each R > there exists a constant k(P) > such that 

z T a{x)z > k(R)\\z\\ 2 for all \\x\\ < R and z G R d , 
where a := aa T , with T denoting the transpose. 
Definition 2.1. For / G C 2 (M d ) define 

Lf(x,ui,u 2 ) := b(x,ui,u 2 ) ■ Vf(x) + ^ tr(a(x)V 2 /(x)) 

for x G M d and (u\,u 2 ) £ Ui x U 2 . Also define the relaxed extended controlled generator L 
by 

Lf(x,V!,V2):= / / L/(x,-ui,n 2 )?Ji(dn 1 )v2(du2), / G C 2 (M d ) , 

for and (i>i, f 2 ) G Vi x V 2 . 

We denote the set of all stationary Markov strategies of player i by Mi , i = 1,2 . 

2.2. Zero-sum ergodic game. Let f : R d x U\ x U 2 — > [0, 00) be a continuous function, 
which is also locally Lipschitz continuous in its first argument. We define the relaxed running 
payoff function r : R X Vi X V 2 — > [0,oo) by 

r(x,vi,v 2 ) := \ \ f(x,u 1 ,u 2 )v 1 (du 1 ) v 2 (du 2 ) . 
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Player 1 seeks to maximize the average payoff given by 



liminf — E x 

T->oo T 



r(X(t),v 1 (t),v 2 (t))dt 



(2.2) 



over all admissible controls v\ G U\ , while Player 2 seeks to minimize (|2.2p over all v 2 G 
W2. Here E x is the expectation operator corresponding to the probability measure on the 
canonical space of the process starting at -X"(0) = x. 

Assumptions on Ergodicity: We consider the following ergodicity assumptions: 

(A3) There exist a nonnegative inf-compact function V G C 2 (M. d ) and positive constants 
ko, k\ and c such that 

LV(x,ui,u 2 ) < fco — 2kxV(x) , 
max f(x,ni,U2) < cV(x) 

UieUl,U2&J2 



for all U2) £ U\ x U 2 , and x G M. d . 

(A3') There exist nonnegative inf-compact functions V G C 2 ( 
positive constants ko and c such that 



LV(x, ui,u 2 ) < ko — h(x) , 
r(x, u\,U2) < ch(x) 



and /i G 



), and 



■> 0. 



max 

MlG(7i, 1126^2 

for all (tii, 1*2) G E/i x [7 2 , and x G M d . Also, 

max Ml g^ l!U2 g[/ 2 f(x,ni,-u 2 ) 
/i(x) 

In this section we use assumption (A3'), while in Section 3 we employ (A3) which is 
stronger and equivalent to geometric ergodicity in the time-homogeneous Markov case. 
We start with a theorem which characterizes the value of the game under a discounted 
infinite horizon criterion. For this we need the following notation: For a continuous function 
V: R d -> (0,oo), C v (R d ) denotes the Banach space of functions in C(M d ) with norm 



V 



sup 



/(*) 



V(x) 



Theorem 2.1. Assume (Al), (A2) and (A3'). For a > 0, there exists a solution ip a G 
C v (R d ) n C 2 (R d ) to the p.d.e. 



aip a (x) = min max [Lrp a (x, V\, v 2 ) + r(x, v\, V2)] 
V2&V2 vieVi 

= max min [Lrp a (x, V\, V2) + r(x, Vi, V2)] 
v 1 ev 1 V2&V2 



(2.3) 
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and is characterized by 



^a{x) = sup inf E x 



inf sup E x 

V2&A2 Vl eUi 



-at 







r(X(t), Vl (t),v 2 (t))dt 



-at 



r(X{t),v 1 {t),v 2 {t))dt 



Proof. Let Br denote the open ball of radius R centered at the origin in 

atf%{x) 



min max [L(p^(x, v\, v 2 ) + r(x, vi,v 2 )] , 
V2&V2 t'leVi 



The p.d.e. 

(2.4) 



tp£ = on dB R 

has a unique solution ipf* in C 2 (Br) n C(Br), see [U Theorem 15.12, p. 382]. Since 



min max \Lip^(x, v\, v 2 ) + r(x, v\, v 2 )] 
V2&V2 vi&Vi 



max min \Lf^(x, vi,v 2 ) + r(x,vi,v 2 )\ 



it follows that ip^ G C 2 {Br) n C(Br) is also a solution to 



max min [Lw^fi, ui, v 2 ) + r(x, ui,*U2)l , 



(2.5) 







on 



dB R . 



Let Vi a : Br — >• V\ be a measurable selector for the maximizer in (|2.5p and uj^, : — > V% 
be a measurable selector for the minimizer in (|2.4|) . Then the p.d.e. 



*5 



on dB R 



has a unique solution ip^ G C 2,t (Br) n C(Br), < r < 1. By a routine application of Ito's 
formula, it follows that 



inf E x 

V2&A2 



e- at r{X(t),v? a (X(t)),v 2 (t))dt 



(2.6) 



where 



t r := inf {t > : ||X(i)|| > R} 



and X is the solution to (|2.1|) corresponding to the control pair (vi a ,v 2 ), with v 2 G U; 



2 ■ 



Repeating the above argument with the outer minimizer v 2a of (|2.4p . we similarly obtain 



sup E x 



-at 



r{X{t)Mt)A a (X{t))) dt 



(2.7) 



Combining (|2.6p and (|2.7p . we obtain 



inf sup E x 

V2&A2 Vl eu x 



TR 



-at 



r(X(t),vi(t),v 2 (t))dt 



< sup inf E x 



-at 



r(X(t), Vl (t),v 2 (t)) dt 



RELATIVE VALUE ITERATION FOR STOCHASTIC DIFFERENTIAL GAMES 



■5 



which implies that 



sup inf E x 



inf sup E x 

V2&U2 v ±&Ai 



-at 



r(X(t),v 1 {t),v 2 {t)) dt 



-at 



r(X(t),v 1 (t),V2(t))dt 



ip a {x) := sup inf E x 



It is evident that <fa( x ) — i^aix), x G M. d , where 

e- at r(X(t), Vl (t),v 2 (t))dt , 

Also tp** is nondecreasing in R. By Assumption (A3'), it follows that 



x G 



^a(x) < cE x 



-at 



h(X(t)) dt 



where X is a solution to (12. lj) corresponding to some stationary Markov control pair. Since 
the function x \-t E x [J °° e~ at h[X(t)) dt] is continuous, it follows that ip a G Lf oc (M. d ) for 
1 < p < 00. 

By Benes' measurable selection theorem [3] there exists a pair of controls (v^ a: v^) G 
M\ x M2 which realizes the minimax in (|2.4p - (|2.5p . i.e., for all x G Br the following holds: 



max min [L<p*(x, v x , v 2 ) + r(x, v\, v 2 )] = L<p%(x, v? a (x), v^x)) + r(x, v? a (x), v^(x)) . 

V1&V1 U26V2 

Hence ip R G C 2 {Br) n C{Br) is a solution to 

a^(x) = L<pK(x,v? a (x),vi\(x)) +r(x,v? a (x), V2 \{x)) , xeB R . 
Hence by [H Lemma A. 2. 5, p. 305], for each 1 < p < 00 and R' > 2R, we have 



^HwMBfl) " Kl {\\V« \\LP(B 2R ) + \\ L( Pa -<X<p5\\lp 



R' 



(B 2R ) 



l LP(B 2R 



where i^i > is a constant independent of -R' and K2{R) is a constant depending only 
on the bound of r on i?2i?- Using standard approximation arguments involving Sobolev 
imbedding theorems, see [2J p. Ill], it follows that there exists ip a G W^^W 1 ) such that 
ip R t ipa as i? f 00 an d V'a is a solution to 

aij) a {x) = max m hi \Bip a (x, v\, v 2 ) + r(x, v±, v 2 )] ■ 
vieVi v 2 sv 2 

By standard regularity arguments, see [21 p. 109], one can show that V« G C 2 ' r {R d ), < 
r < 1. Also using the minimax condition, it follows that ip a G C 2,r (IR rf ), < r < 1, is a 
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solution to 

atpa(x) = min max \Lip a (x,vi,V2) + r(x, vx,V2)] 
V2&V2 vieVi 

= max min [Lrp a (x, v\, V2) + r(x, v\, V2)] ■ 
v 1 ev 1 V2&V2 

Let vf £ Mi and v 2 6M2 be an outer maximizing and an outer minimizing selector for 
(|2.3|) . respectively, corresponding to ip a given above. Then tp a satisfies the p.d.e. 

aip a (x) = max [Ltp a (x, vi, v% (x)) + r (x, vi, v^(x))] . 

For v\ GUi, let X be the solution to (|2.ip corresponding to (yx^v^ ) and the initial condition 
iGK d . Applying the Ito-Dynkin formula, we obtain 

r-TR 



E x [e- aT ^ a (X(r R ))] < ~E X 

Since ip a > 0, we have 



-at 



r(*(t),t*(t),t#pr(t)))di 



ip a (x) > E x 
Using Fatou's lemma we obtain 

ip a (x) > E x 

Therefore 



ipaix) > sup E x 
vieUi 



-at 



r(X(t),v 1 (t),v 2 *(X(t))) dt 



-at 



r(X(t), Vl (t),v%(X(t)))dt 



-at 



r(A^),^(t),t; 2 Q (X(i))) dt 



(2.8) 



(2.9) 



Similarly, for V2 £U 2 , let X be the solution to (|2.ip corresponding to (vf , V2) and the initial 
condition x G M. d . By applying the Ito-Dynkin formula, we obtain 



TR 



-at 



r(X(t),vUX(t)),v 2 (t)) dt 



Hence 



lp a (x) < E x 



-at 



r(X(t),vUX(t)),v 2 (t)) dt 



+ E x [e- ar ^ a (X(T R ))] 



By m Remark A.3.8, p. 310], it follows that 



lim E x [e- aTR ip a (X(r R ))] = 0. 

R[oo 



Hence, we have 



Therefore 



ip a (x) < E x 



ipaix) < inf E x 
V2&A2 



-at 



r(X(t),vf(X(t)),v 2 (t))dt 
r(X(t),vf(X(t)),v 2 (t))dt 



-at 



(2.10) 
(2.11) 
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By §2M) and (|2TTTjl . we obtain 



ijj a (x) = E, 
Also by (USD and ([2TT0]) we have 



-at 



r{X(t),v?(X(t)),v%(X(t))) *t 



(2.12) 



inf sup E x 

V2&A2 Vl eUi 



-at 



r(X(t),v 1 (t),v 2 (t))dt 



< sup inf E x 



-at 



r(X(t),v 1 (t),v 2 (t))dt 



This implies the desired characterization. 



□ 



Remark 2.1. Using Theorem 12.11 one can easily show that any pair of measurable outer 
maximizing and outer minimizing selectors of (|2,3p is a saddle point equilibrium for the 
stochastic differential game with state dynamics given by (|2.ip and with a discounted cri- 
terion under the running payoff function r. 

Theorem 2.2. Assume (Al), (A2) and (A3'). Then there exists a solution (/?,(/?*) G 
R x C v (R d ) n C 2 (R d ) to the Isaac's equation 



f3 = min max \Ltp*(x, v%, v 2 ) + r(x, v\, v 2 )\ 

V2&V2 v\&V\ 

= max min \Lip*(x, v\,v 2 ) + r(x,vi,v 2 )] , 

vieVi V2&V2 



(2.13) 



<^*(0) = 
such that (3 is the value of the game. 
Proof. For (v\,v 2 ) £ Mi x M 2 , define 



J a (x,vi,v 2 ) := E x 



-at 



r(X(t), Vl (X(t)),v 2 (X(t)))dt 



x e 



where A is a solution to (|2.ip corresponding to (v\, v 2 ) £ Ai\ x Ai 2 . Hence from (|2.12p . we 
have 

1p a {x) = J a (x,vf,V 2 ) , 

where {vf,v 2 ) £ -Ml xA^2 is a pair of measurable outer maximizing and outer minimizing 
selectors of (|2.3p . Using (A3'), it is easy to see that (vf ,v 2 ) is a pair of stable stationary 
Markov controls. Hence by the arguments in the proof of [2J Theorem 3.7.4, pp. 128-131], 
we have the following estimates: 



\\lpa ~ 1pa(0)\\w*>P(B R ) 



< 



P[v?,v. 



sup aip a (x) < K3 



n[v?,v$](B R ) \ V [v a ,v%](B R ) 
+ 



max r(x,vi,v 2 
(x,vi,V2)eB 4R xVixV2 



+ 



max 



rj[vf ,v%](B R ) (x,vi,v 2 )eB 4R xVixV 2 



r(x,vi,v 2 ) 



(2.14) 
(2.15) 
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where r/[vf , t>2 ] is the unique invariant probability measure of the process (12. ip corresponding 
to (vf , v% ) and 



It follows from [21 Corollary 3.3.2, p. 97] that 



sup < 00 • ( 2 - 16 ) 

o>0 



Also from (2.6.9a); p. 69 and (3.3.9); p. 97] it follows that 



inf ti[v?,v2](B r ) > 0. (2.17) 

Q>0 



Combining (f2TTl)) - (f2"T7) . we have 

||^a - 1pa(0)\\w 2 'P(B R ) < K 4 

sup aip a (x) < K4 
xeB R 



(2.18) 



where K4 > is a constant independent of q > 0. 
Define 

i> a {x) :=Mx) -Mo), xeR d . 

In view of (j2. 18j) . one can use the arguments in [21 Lemma 3.5.4, pp. 108-109] to show 
that along some sequence a n 1 0, a n ip a (0) converges to a constant g and ^> an converges 
uniformly on compact sets to a function G C 2 (]R rf ), where the pair (#, ?/>) is a solution to 
the p.d.e. 

g = min max \Lip(x, V\, v 2 ) + r(x, t>i, ^2)] , 
^(0) = . 

Moreover, using the Isaac's condition, it follows that (g,ip) £Rx C 2 (M. d ) satisfies f|2. 13|) . 

We claim that $ G o(V), i.e., $H -> as ||x|| — > 00. To prove the claim let (vi,V2) G 
A^i x A^2 be a pair of measurable outer maximizing and outer minimizing selectors of (|2. 13|) 
corresponding to ip. Let X be the solution to (|2.1f) under the control (Oi,^)- Then by an 
application of the Ito-Dynkin formula and the help of Fatou's lemma, we can show that for 
all x G R d 



4>{x) > E x 

where 



r(X(t),vi(X(t)),v 2 (X(t))) -q) dt 



+ min #(y) , (2.19) 
\\y\\= r 



f r = inf {t > : \\X(t)\\ < r} . 

Let vf G Mi be a measurable outer maximizing selector in (|2.3p . Then the function 
Tp a G C 2,r (M. d ) given in Theorem 12. II satisfies the p.d.e. 

atjj a = min [bip a (x, vf (x), v 2 ) + r(x, vf(x), v 2 )] . (2.20) 
D2GV2 
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Let X be the solution to (|2.ip under the control (vf, v 2 ), with v 2 &U 2 -, and initial condition 
x € M. d . Then by applying the Ito-Dynkin formula to e~ at t/j a (X(t)) and using (|2,20p . we 
obtain 



rr r /\T R 

E x [e- a ^ AT ^MX(f r Atr))] -M%) > ~E X / r(X(t),vUX(t)),v 2 (t))dt 

Uo 

which we write as 



4> a (x) < E x 



r{X(t),vUX(t)),v 2 (t))dt 

Using [2, Remark A. 3. 8, p. 310], it follows that 

E x [e- aTR ip a {X(r R ))I{r r >r R }] < E x [e- aT ^ a (X {t r ))] 

Hence from ([231]) and (f2T22j) , we obtain 



+ E x [e- Q ^ AT ^MX(f r A tr))] . (2.21) 



R—>-oo 



-> 0. 



(2.22) 



ij) a (x) < E x 

Therefore, 

i>a(x) < E x 

= E T 



r(X(t),vf(X(t)),v 2 (t))dt 



+ E x [e- a ^ip a (X(T r ))} . 



r(X(t),vf(X(t)),v 2 (t))dt 



+ E x [e- a ^MX(T r ))-A 



r(X(t),vUX(t)),v 2 (t))- Q ) dt 



< E x 



+ E x [ip a (X(r r ))-ip c 
+ E x [a-\l - e- a ^)(g - aMX(r r )))} 
r{X(t),vnX(t)),v 2 (t)) - 0) dt 



+ M(r)+ E x [f r ] sup I q - aV>a (y) | 



\\y\\= r 



< sup E x 
vieMi 



r(X(t),v 1 (X(t)),v 2 (t))-g) dt 



+ M(r) + sup \g-atpa(y)\ sup E x [f r ] 



\\y\\= r 



for some nonnegative constant M(r) such that M(r) — > as r J, 0. Next from the definition 
of if), by letting a J, along the sequence given in the proof of Theorem 12.21 we obtain 



ip(x) < sup E x 



r{X(t), Vl (X(t)),v 2 (t)) -g) dt 



+ M(r). (2.23) 



By combining (|2.19p and (|2.23j) . the result follows by [2, Lemma 3.7.2, p. 125]. This 
completes the proof of the claim. 



10 



ARI ARAPOSTATHIS, VIVEK S. BORKAR, AND K. SURESH KUMAR 



Let (vx,V2) G Mi x M2 be a pair of measurable outer maximizing and minimizing 
selectors in (I2.13P corresponding to ifj. Then (g,ip) satisfies the p.d.e. 



max \Lip(x,vi,v 2 (x)) + r(x,vi,v 2 (x))] 



Let vi £ Ui and X be the process in (|2.ip under the control (vi,v 2 ) and initial condition 
x € M. d . By applying the Ito-Dynkin formula, we obtain 



E x [rP(X(tAT R ))] -$(x) < -E x 



tAr R 



r(X(t),vi(t),v 2 (X(t))) - g) dt 



Hence 



gt > E x 



t/\TR 



r(X(t),vi(t),v 2 (X(t)))dt 



+ E x [i>(X(tAT R ))] -$(x) 



for all t > 0. Using Fatou's lemma and [21 Lemma 3.7.2, p. 125], we obtain 



gt > E x 



r(X(t), Vl (t),v 2 (X(t)))dt 



+ E X 



(X(t))]-4>(x), t>0. 



Dividing by t and taking limits again using [2, Lemma 3.7.2, p. 125], we obtain 



g > liminf — E x 

t—KX) t 

Since vi £ lAi was arbitrary, we have 

g > sup liminf — E x 



r{X{t),vi{t)MX{t))) dt 



r(X(t), Vl (t),v 2 (X(t))) dt 



1 



> inf sup liminf — E x 

V2&U2 vx&Ai t 



r(X(t),vi(t),v 2 (t))dt 



(2.24) 



The pair (g, i/j) also satisfies the p.d.e 
Q 



min \Lijj(x,vi(x),v 2 ) + r(x, vAx), v 2 )] 
V2&V2 



Let v 2 £ U 2 and X be the process in (|2.ip corresponding to {vi-,v 2 ) and initial condition 
By applying the Ito-Dynkin formula, we obtain 



E x [ip(X(t A Tjt))] -$(x) > -E x 



tAT R 



r(X(t),vi(X(t)),v 2 (t)) - g) dt 



Hence 



gE x [tAT R ] < E x 



r{X(t),vi(X(t)),v 2 (t))dt 



+ E x [^{X{tAr R ))] -#r). 



Next, by letting R — )• 00 and using the dominated convergence theorem for the l.h.s. and 
[21 Lemma 3.7.2, p. 125] for the r.h.s., we obtain 

ft 



gt < E x 



r(X(t),vi(X(t)),v 2 (t))dt 



+ E x [^(X(t))]-^(x). 
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Also by (2j Lemma 3.7.2, p. 125], we obtain 



q < liminf — E x 

t—^oo t 

Since v 2 £ ti 2 was arbitrary, we have 

q < inf liminf — E x 

V2&A2 t-s-oo t 



r(X(t),v 1 (X(t)),v 2 (t))dt 



r(X(t),v 1 (X(t)),v 2 (t))dt 



1 



< sup inf liminf — E x 

t,jgWi V2&U2 t->-co t 



Combining (|2.24p and (|2.25p . we obtain 



1 



inf sup liminf — E x 

112&A2 Vl eUi t-*- 00 t 



1 



r(X(t),v 1 {t),v 2 {t))dt 



r(X(t),v 1 {t),v 2 (t))dt 



r(X(t),v 1 (t),v 2 (t))dt 



(2.25) 



= sup inf liminf - E. 

vieUi V2&J2 t 

i.e. g = (3, the value of the game. This completes the proof. □ 

Remark 2.2. Using Theorem 12.21 one can easily prove that any pair of measurable outer 
maximizing and outer minimizing selectors of (|2,3p is a saddle point equilibrium for the 
stochastic differential game with state dynamics given by (|2.ip . 



min max \Lip(t, x, v\, v 2 ) + r(x, V\, v 2 )] — ip(t, 0) 
V2&V2 uiSVi 



(3.1) 



3. Relative Value Iteration 
We consider the following relative value iteration equation. 

<p(0,x) = tp {x) , 

where ipo G Cy(M. d ) n C 2 (M d ) . This can be viewed as a continuous time continuous state 
space variant of the relative value iteration algorithm for Markov decision processes [?]■ 

Convergence of this relative value iteration scheme is obtained through the study of the 
value iteration equation which takes the form 

d ~ 

— (t,x) = min max \Lfi(t, x,v±,v 2 ) + r(x,vi,v 2 )] — ft, 

at V2&V2 vieVi (3 2) 

<p(0,x) = tp Q (x) , 

where /3 is the value of the average payoff game in Theorem 12.21 

Under Assumption (A3), it is straightforward to show that for each T > there exists a 
unique solution tp in C v ([0,T] x R d ) n C X ' 2 ([Q,T] x R d ) to the p.d.e. flES]) . 

First, we prove the following important estimate which is crucial for the proof of conver- 
gence. 

Lemma 3.1. Assume (Al)-(A3). Then for each T > 0, the p.d.e. in (|3.ip has a unique 
solution ip G C v ([0,T] x R d ) n C 1 ' 2 ^^] x 
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Proof. The proof follows by mimicking the arguments in [TJ Lemma 4.1], using the following 
estimate 

E x [V{X(t))] < h- + V (x)e- 2k ^, (3.3) 

where X is the solution to (|2.1|) corresponding to any admissible controls v\ and V2 and 
initial condition x G M. d . The estimate for ip follows from the arguments in [;2, Lemma 2.5.5, 
pp. 63-64], noting that for all Uj G Ui, i = 1,2, we have 

f E x [r n {X( S ),v 1 (s),v 2 (s))]ds < c f E x [V(X(s))] ds 
o Jo 

< ^(k t + V(x)), 

where r n (x,vi,V2) :=n A r(x, v\, Vz) is the truncation of r at n > 0. □ 

Next, we turn our attention to the p.d.e. in (|3.2j) . 

Lemma 3.2. Assume (A1)-(A3). For each p G C v (R rf ) D C 2 (R d ), the solution (p of the 
•p.d.e. (|3.2p satisfies the following estimate 

\tp*(x) - <p(t,x)\ < \\<p*-w\\v (jfe + V(xj\ forallxeR d ,t>0. 

Proof. Let v* E Jvli and v\ G be an outer maximizing and outer minimizing selector of 
(|2.13p . respectively. Also let ^i(-,-) and V2(-,-), respectively, be an outer maximizing and 
an outer minimizing selector of (|3.2p . By applying Ito's formula to tp* — tp , we obtain 

E x [<p*(X*(i)) — ipo(X*(t))] < <p*(x)-p(t,x) < E x [p*(X(t))-p (X(t))] , 

where X* , resp. X is the solution to (|2.ip corresponding to (u*,^) anc ^ (^1^2) respectively 
for the initial condition xeRf An application of (|3.3p completes the proof. □ 

Arguing as in the proof of [TJ Lemma 4.4], we can show the following: 

Lemma 3.3. Assume (A1)-(A3). If <p(0,x) = tp(0,x) = <po(x) for some (po G C\>(M. d ) n 
C 2 (R d ), then 

(p(t, x) — (p(t, 0) = (p{t, x) — (p(t, 0) , 

p(t,x) = (p(t,x) — e~ t 

for all i£i d and i > 0. 

Convergence of the relative value iteration is asserted in the following theorem. 

Theorem 3.1. Assume (A1)-(A3). For each p 6 C v {R d ) n C 2 (M d ), converges to 

(p*(x) +j3 as t — )• 00. 

Proof. By closely mimicking the arguments of [1, Theorem 4.5], we can show that <p(t, x) — > 
<p*{x) + Co as t — > 00 for some constant Co G E which depends on <pq. By Lemma 1331 we 
have ^ 

y>(t,a?) = ^,s)+ / e s -*(/3-<^(s,0))ds. 



/ eV(s, 0) ds + /3(1 - e" 

JO 



Hence <p{t,x) — > <p*(x) + j3 as t — >■ 00. □ 
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4. Risk-Sensitive Control 

In this section, we apply the results from Section 3 to study the convergence of a relative 
value iteration scheme for the risk-sensitive control problem which is described as follows. 
Let U be a compact metric space and V = V{U) denote the space of all probability 
measures on U with Prohorov topology. We consider the risk-sensitive control problem 
with state equation given by the controlled s.d.e. (in relaxed form) 

dX(t) = b(X(t),v(t))dt + a(X(t))dW(t), (4.1) 



and payoff criterion 
J(x, v) :- 



liminf — InE 



exp 



r(X(t),v(t))dt) X(0)=x 



All processes in (|4.ip are defined in a common probability space (17, J-, P) which is as- 
sumed to be complete. The process W is an Revalued standard Wiener process which is 
independent of the initial condition Xq of (|2.1j) . The control v is a V- valued process which 
is jointly measurable in (t,uj) E [0, oo) x 17 and non-anticipative, i.e., for s < t, W(t) — W(s) 
is independent of F s := the completion of (j(Xq, v(r), W(r), r < s) . We denote the set of 
all such controls (admissible controls) by IA. 

Assumptions on the Data: We assume the following properties for the coefficients b and a: 

(Bl) The functions b and a are continuous and bounded, and also Lipschitz continuous 
in x 6 M d uniformly over v € V. Also (o"cr T ) _1 is Lipschitz continuous. 

(B2) For each R > there exists a constant k(R) > such that 



z T a(x)z > k(R)\\z\ 



for all llxll < R and z £ 



where a := o~a T . 



Asymptotic Flatness Hypothesis: We assume the following property: 

(B3) (i) There exists a c > and a positive definite matrix Q such that for all x, y G M. d 
with x 7^ y, we have 



2(b(x,v) - b(y,v)) T Q(x - y) +tr((a(x) - a(y)) (a(x) - a(y)) T Q 

\\(a(x) - a(y)) T Q(x - y) 



(x - y) T Q{x - y) 



< —c \\x — y\\ 2 



(ii) Let Lip(/) denote the Lipschitz constant of a Lipschitz continuous function /. 
Then 

2||aa T ||^ Lip(r)Lip((aa T )- 1 ) < c 2 . 
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We quote the following result from j5j Theorems 2.2-2.3]: 
Theorem 4.1. Assume (B1)-(B3). The p.d.e. 



<p*(0) 



mm max 

veV w€R d 

max min 



where 



Lf(x,w,v) 



L(p*(x,w,v) + r(x,v) — \w J (a 1 (x))w 
Lcp*(x, w, v) + r(x, v) — | w T [a~ 1 (x))w 



(b(x, v)+w)- V/(x) + X - tr(a(x)V 2 /(x)) , / G C 2 



(4.2) 



/ias a unique solution (f3,ip*) £ R x C 2 (M d ) n o(||x||). Moreover, j3 is the value of the 
risk-sensitive control problem and any measurable outer minimizing selector in (I4.2j) is risk- 
sensitive optimal. Also in (14. 2D . £/te supremum can be restricted to a closed ball V = Br 
for 

Lip(r) Lip^T)- 1 )^ 2 
R '-^ + Wc ' 

where K is the smallest positive root (using (B3) (ii)^ of 

^ ||a<7 T || 00 Lip(( < 7<7 T )- 1 ) x 2 - c 5 / 4 x + Lip(r)||^ T || 00 = 0. 



For the stochastic differential game in (|4.2p we consider the following relative value iter- 
ation equation: 



9< fu ^ 



mm max 



L(p(t,x,w,v) + r(x,v) — (a — cp(t, 0) , 



V9(0,X) = V9 (x) , 

where <p G Cy(M d ) n C 2 (R d ) with 



V(s) 



l+a 



e + {x T Q X y/ 2 ' 

for some positive constants e and a. Here note that Assumption (B3) implies Assump- 
tion (A3) of Section 2 for the Lyapunov function V given above, see equation (7.3.6), 
p. 257]. 

By Theorems 13.11 and 14.11 the following holds. 

Theorem 4.2. Assume (B1)-(B3). For each ip G C v (R d ) n C 2 (R d ), <p(t,x) converges to 
<p*(x) + (5 as t — )• oo. 



The relative value iteration equation for the risk-sensitive control problem is given by 

[I4(t,x,v) + (r(x,v) - lnip(t,0))ip(t,x)) , 

ip(0,x) = tp (x) , 



dtp,, s 



mm 



(4.3) 
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where 

Lf(x, v) := b(x, v) ■ Vf(x) + \ tr (a(x)V 2 /(x)) , / G C 2 (R d ) . 

That one has lnyj(t,0) instead of ijj(t,0) as the 'offset' is only natural, because we are 
trying to approximate the logarithmic growth rate of the cost. We have the following 
theorem: 

Theorem 4.3. Let tp* be the unique solution in the class of functions which grow no faster 
than e"^" of the HJB equation for the risk-sensitive control problem given by 

(3yj* = min \Ltp*(x, v) + r(x, v)tp*~\ , Vj*(0) = I . 

Under assumptions (B1)-(B3) the solution ip(t, x) of the relative value iteration in (|4.3p 
converges as t — >• oo to e^ip*{x) where (3 is the value of the risk- sensitive control problem 
given in Theorem \4-l\ 

Proof. A straightforward calculation shows that ip* = e v , where <p* is given in Theorem l4.11 
Then it easily follows that ip(t,x) = e v ^' x \ where p is the solution of the relative value 
iteration for the stochastic differential game in (|4.2[) . From Theorem 14.21 it follows that 
ip(t,x) — > e /3 ip*(x) as t — > oo, which establishes the claim. □ 



5. Acknowledgement 

The work of Ari Arapostathis was supported in part by the Office of Naval Research 
under the Electric Ship Research and Development Consortium. The work of Vivek Borkar 
was supported in part by Grant $dlIRCCSG014 from IRCC, IIT, Mumbai. 



References 

[1] A. Arapostathis and V. S. Borkar, A relative value iteration algorithm for non-degenerate controlled 

diffusions, SI AM J. Control Optim., vol. 50, No. 4, pp. 1886-1902, 2012. 
[2] A. Arapostathis, V. S. Borkar and M. K. Ghosh, Ergodic Control of Diffusion Processes, Encyclopedia 

of Mathematics and its Applications 143, Cambridge University Press, Cambridge, UK, 2012. 
[3] V. E. Benes, Existence of optimal strategies based on a specified information, for a class of stochastic 

decision problems, SIAM J. Control, vol. 8, pp. 179-188, 1970. 
[4] V. S. Borkar and M. K. Ghosh, Stochastic differential games: Occupation measure based approach, 

J. Optim. Theory Appl, vol. 73, No. 2, pp. 359-385, 1992, Errata: vol. 88, No. 1, pp. 251-252, 1996. 
[5] V. S. Borkar and K. Suresh Kumar, Singular perturbations in risk-sensitive stochastic control, SIAM J. 

Control Optim., vol. 48, No. 6, pp. 3675-3697, 2010. 
[6] D. Gilbarg and N. S. Trudinger, Elliptic Partial Differential Equations of Second Order, Classics in 

Mathematics, Springer, Reprint of the 1998 Edition. 
[7] D. J. White, Dynamic programming, Markov chains, and the method of successive approximations, J. 

Math. Anal. Appl, vol. 6, No. 3, pp. 373-376, 1963. 



16 AM ARAPOSTATHIS, VIVEK S. BORKAR, AND K. SURESH KUMAR 

Department of Electrical and Computer Engineering, The University of Texas at Austin, 
1 University Station, Austin, TX 78712 
E-mail address: ari@mail.utexas.edu 

Department of Electrical Engineering, Indian Institute of Technology, Powai, Mumbai 
400076, India 

E-mail address: borkar.vs@gmail.com 



Department of Mathematics, Indian Institute of Technology, Powai, Mumbai 400076, India 
E-mail address: suresh@math. iitb . ac . in 



